Task 1

Let $f(x_1, x_2) = (x_1 + x_2)^2$ and assume that $x_1, x_2 \sim U[-1,1]$ and $x_1=x_2$ (full dependency) Calculate PD, ME and AL profiles for variable x1 in this model.

$$ \begin{align} g^{PD}_{x_1}(z) &= E_{x_2}[f(z, x_2)] = \\ &= z^2 + E_{x_2}[x_2^2] + 2zE_{x_2}[x_2] = \\ &= z^2 + 2z \int_{-1}^{1} \frac{x_2}{1-(-1)} \,dx_2 + \int_{-1}^{1}\frac{ x_2^2}{1-(-1)} \,dx_2 = \\ &= z^2 + 2z\frac{x^2}{4}\Big|_{-1}^1 + \frac{x^3}{6}\Big|_{-1}^1 = z^2 + \frac{z(1 - 1)}{2} + \frac{1 - (-1)}{6}\\ &= z^2 + \frac{1}{3} \end{align} $$

$$ \begin{align} g^{ME}_{x_1}(z) &= E_{x_2}[f(z, x_2)\;|\;x_1=z] = \\ &= E_{x_2}[(z+x_2)^2\;|\; x_1 = z] = \\ & \Big\{ x_1=x_2 \Big\} \\ &= E_{x_2}[(2z)^2] =\\ &= 4z^2 \end{align} $$

$$ \begin{align} g^{AL}_{x_1}(z) &= \int_{-1}^z E\Big[ \frac{\partial f(v,x_2)}{\partial v}\;|\; x_1 = v\Big] \,dv = \\ &= \int_{-1}^z E\Big[ \frac{\partial\;v^2 + x_2^2 + 2vx_2}{\partial v}\;|\; x_1 = v\Big] \,dv = \\ &= \int_{-1}^z E\Big[ 2v + 2x_2\;|\; x_1 = v\Big] \,dv = \\ & \Big\{ x_1=x_2 \Big\} \\ &= \int_{-1}^z E[ 4v] \,dv = \\ &= 4\int_{-1}^z v \,dv = \\ &= 4\frac{v^2}{2} \Big|_{-1}^z =\\ &= 2z^2 - 2(-1)^2 =\\ &= 2(z^2-1) \end{align} $$

Task 2

0. For the selected data set, train at least one tree-based ensemble model, e.g. random forest, gbdt, xgboost.

For this task I reused my models from homework 1 (with default values of hyperparameters, which turned out to make them learn slightly better). Below the performance of random forest classifier.

In [3]:
explainer.model_performance()
Out[3]:
recall precision f1 accuracy auc
RandomForestClassifier 1.0 1.0 1.0 1.0 1.0

1. Calculate the predictions for some selected observations.

I chose two observations with associated different TARGET value. Apart from presenting exact prediction, I'm also displaying detailed coefficients how much the model was prone to choose 0 or 1. Results are obtained using random forest classifier.

In [4]:
show_selected_predictions()
Out[4]:
Feature 0 Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 Feature 8 Feature 9 TARGET pred 0 pred 1
46 6.2 0.450 0.26 4.4 0.063 63.0 206.0 0.9940 3.27 0.52 1 0.30 0.70
147 6.4 0.595 0.14 5.2 0.058 15.0 97.0 0.9951 3.38 0.36 1 0.27 0.73

2. Then, calculate the what-if explanations of these predictions using Ceteris Paribus profiles

I will display results for both observations on the same figures for better readability. I chose the same observations and model as above.

In [5]:
cp = explainer.predict_profile(new_observation=selected_observations_X)
cp.plot()
Calculating ceteris paribus: 100%|██████████| 10/10 [00:00<00:00, 210.45it/s]

3. Find two observations in the data set, such that they have different CP profiles. For example, model predictions are increasing with age for one observation and decreasing with age for another one.

I've carefully chosen observations 46 and 137 in point (2) to meet that requirement. We can observe not only different values on those figures, but also we can spot dissimilar trends (e.g. profiles of feature 0 are symmetric, feature 6 has completely different profiles). Although the overall shape of those figures are similar, they can sometimes differ only slightly, for example the profile of feature 5 initially dives near 0 (for 37'th observation), then comes back and aligns with the figure associated with (147'th observation).

4. Compare CP, which is a local explanation, with PDP, which is a global explanation

In [13]:
pdp = explainer.model_profile()
pdp.plot(geom="profiles", variables=['Feature 0', 'Feature 1', 'Feature 2', 'Feature 3', 'Feature 4', 'Feature 5', 'Feature 6', 'Feature 7', 'Feature 8', 'Feature 9'])
Calculating ceteris paribus: 100%|██████████| 10/10 [00:02<00:00,  3.71it/s]

PDP aligns perfectly with the majority of cp profiles (apart from a limited number of deviations like the example above). I was able to produce the image only in browser and it may not be visible there. However, I managed to download it, the image is in the newplot.png file

5. Compare PDP between at least two different models.

I'm going to compare my two models from HW1 - already tested random forest, and linear regression.

In [18]:
pdp2 = explainer2.model_profile()
pdp.plot(pdp2)
Calculating ceteris paribus: 100%|██████████| 10/10 [00:00<00:00, 111.38it/s]

What strikes first is the smoothness of logistic regression's profiles. Almost all of them looks like the interpolation of more detailed versions associated with random forest. There appears the conclusion that although logistic regression was able to asses more or less accurately the contribution of ach variable, it remained beyond its comprehension to spot detailed relationships as precisely as random forest model did.

Appendix

Load libraries and dataset:

In [2]:
import dalex as dx
import matplotlib
import platform
import sklearn
from copy import deepcopy
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score, recall_score, precision_score

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


import warnings
warnings.filterwarnings("ignore")

# https://github.com/adrianstando/imbalanced-benchmarking-set/blob/main/datasets/wine_quality.csv
dataset = pd.read_csv('wine_quality.csv')
y = dataset["TARGET"].apply(lambda x: 1 if x == 1 else 0)
X_tmp = dataset.drop(axis=1, columns=["TARGET", "Unnamed: 0"])

# rename columns 
X = pd.DataFrame()
X_tmp.head()
for i in range(0,10):
    X[f"Feature {i}"] = X_tmp[str(i)]
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Feature 0  4898 non-null   float64
 1   Feature 1  4898 non-null   float64
 2   Feature 2  4898 non-null   float64
 3   Feature 3  4898 non-null   float64
 4   Feature 4  4898 non-null   float64
 5   Feature 5  4898 non-null   float64
 6   Feature 6  4898 non-null   float64
 7   Feature 7  4898 non-null   float64
 8   Feature 8  4898 non-null   float64
 9   Feature 9  4898 non-null   float64
dtypes: float64(10)
memory usage: 382.8 KB

Train models

In [17]:
model = RandomForestClassifier(random_state=22)
model.fit(X.values, y)
pred = model.predict(X)

model2 = LogisticRegression(random_state=22, max_iter=1000)
model2.fit(X.values, y)
pred2 = model.predict(X)

def fn(model, df):
    return model.predict_proba(df)[:, 1]

explainer = dx.Explainer(model, X, y, predict_function=fn)
explainer2 = dx.Explainer(model2, X, y, predict_function=fn)

def _construct_selected_observations():
    selected_observations_indices = [46, 147]
    result_X = X.iloc[selected_observations_indices]
    result_y = y.iloc[selected_observations_indices]
    pred_y = model.predict_proba(result_X)
    return result_X, result_y, pred_y

selected_observations_X, selected_observations_y, selected_observations_pred = _construct_selected_observations()
    
def show_selected_predictions():
    result = deepcopy(selected_observations_X)
    result["TARGET"] = selected_observations_y
    result["pred 0"] = selected_observations_pred[:, 0]
    result["pred 1"] = selected_observations_pred[:, 1]
    return result
Preparation of a new explainer is initiated

  -> data              : 4898 rows 10 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 4898 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function fn at 0x7f4d9e247be0> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 0.0, mean = 0.0377, max = 0.94
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.29, mean = -0.000363, max = 0.46
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 4898 rows 10 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 4898 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Not specified, model's class short name will be used. (default)
  -> predict function  : <function fn at 0x7f4d9e247be0> will be used
  -> predict function  : Accepts pandas.DataFrame and numpy.ndarray.
  -> predicted values  : min = 2.3e-05, mean = 0.0374, max = 0.774
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.591, mean = -2.98e-06, max = 1.0
  -> model_info        : package sklearn

A new explainer has been created!